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Abstract 

In this paper, we report fast calculation of a computer-generated- 
hologram using a new architecture of the HD5000 series GPU (RV870) 
made by AMD and its new software development environment, OpenCL. 
Using a RV870 GPU and OpenCL, we can calculate 1, 920 x 1, 024 res- 
olution of a CGH from a 3D object consisting of 1,024 points in 30 
milli-seconds. The calculation speed realizes a speed approximately 
two times faster than that of a GPU made by NVIDIA. 



1 Introduction 

CGH (Computer-generated-hologram) has the ability to correctly record and 
reconstruct a light wave for a 3D object. Electroholography[l | using the 
CGH technique is attractive as a 3D display, because the CGH technique 
has remarkable features ; however, due to two significant problems, it is 
difficult to develop a practical 3D display system using electroholography. 
One problem is the need for an SLM (spatial light modulator) that can 
display a CGH with large area and high resolution, because the resolution 
of a CGH is that of wavelength-order [2j HI [5]. The other problem is the 
enormous computational time required for generating a CGH. This paper 
focuses on this problem. 

Assuming that a 3D object is composed of N point light sources, the 
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formula for computing a CGH is expressed as: 

VH) = EVos(^( (PXfe "^ )2 9 +(m "^ )2 )) 

3 J 
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where, I(xh,yh) is the light intensity on a CGH, (x^^h) an d (xj,yj,zj) are 
the coordinates for the CGH and a 3D object, Aj is the light intensity of the 
3D object, A is the wavelength of the reference light, and Pj = irp 2 /(\zj), 
where p is the sampling interval on the CGH plane. Note that the coordi- 
nates (xj,yj) and (xh,yh) are normalized by p. The computational complex- 
ity of the above formula is 0(NN x N y ), where N x and N y are the horizontal 
and vertical sampling numbers of the CGH. This creates very enormous 
computational complexity. 

To solve this problem, several software approaches have been proposed: 
for example, recurrence approaches 0(71 [5], and the look-up table methods 
[TOl lllj . Another approach to dramatically reduce the calculation time is 
the hardware approach, such as FPGA (field-programmable gate array) and 
GPU (graphics processing unit). We have designed and built special-purpose 
computers for holography using FPGA technology, called HORN (HOlo- 
graphic ReconstructioN) [TJl [131 [TJ] . The FPGA-based approaches showed 
excellent computational speed, however, the approach has the following re- 
strictions: the high cost for developing the FPGA board, long development 
time and technical know-how required for the FPGA technology. 

GPU-based approaches have already been applied to the optics field. 
Especially, CGH calculations [151 EH El E] an d reconstruction calculations 
in digital holography [181 ES] are used to accelerate the calculation. In 2007, 
NVIDIA released a new architecture of GPU and its software development 
environment, CUDA (Compute Unified Device Architecture). Using CUDA 
allows us to program GPU easier than prior software developments, such 
as HLSL, Cg language and so forth. Since the release, many papers using 
NVIDIA GPU and CUDA have been published in optics. 

On the other hand, more recently in December 2009, a new GPU of the 
HD5000 series (RV870) made by AMD was released. The RV870 GPU has 
new architecture and its software environment, OpenCL (Open Computing 
Language). The architecture of the RV870 GPU is different from that of 
the NVIDIA GPU. The RV870 GPU has huge potential for fast calculation 
because one GPU chip has over 1,000 floating-point number processors, while 
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one NVIDIA GPU chip has about 200 floating-point number processors. 
However, fast CGH calculation using the RV870 GPU has not been reported 
so far. 

In this paper, we report fast CGH calculation using RV870 GPU and 
OpenCL. Using these, we can calculate 1,920 x 1,024 resolution of a CGH 
from a 3D object consisting of 1, 024 points in 30 milli-seconds. To the best of 
our knowledge, this article is the first report of using the RV870 GPU and 
OpenCL in optics. In addition, we compare the calculation performance 
between the RV870 GPU and the GPU made by NVIDIA. 

In Section 2, we describe a fast CGH calculation on AMD RV870 and 
OpenCL. In Section 3, we show and compare the performance between the 
RV870 GPU and the GPU made by NVIDIA. In Section 4, we conclude this 
work. 

2 Fast calculation of computer-generated-hologram 
on AMD RV870 and OpenCL 




Figure 1: Architecture of RV870 GPU. (a) Outline of RV870 GPU chip (b) 
Thread processor. 

The architecture of RV870 GPU is shown in FigHJ The top level of the 
GPU consists of many SIMD (Single Instruction Multiple Data) engines. 
The SIMD engine has 16 thread processors (TP) and a shared memory, 
which is small and high-speed. 

In addition, the thread processor has four stream cores and one T-stream 
core. The stream core is a simple floating-point-number operation unit. 
And, the T-stream core also has a floating-point-number operation unit 
and special function unit. The special function unit can calculate special 
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functions at high speed, such as trigonometric function, logarithm function 
and so on. The stream cores in the same SIMD engine operate by the same 
instructions; therefore, the SIMD engine is similar to a SIMD processor. 

Calculation on the GPU using OpenCL is executed using the following 
steps: (1) We initialize a GPU using OpenCL API (Application Program 
Interface) functions. (2) We allocate the required amount of memory on 
a device memory in Fig[TJ The device memory is large amount, but large 
latency access of memory. (3) We send an input data to the device memory. 
(4) We send a kernel function from the host computer to the GPU. The 
kernel function is compiled to native code of GPU using the OpenCL com- 
piler. The GPU executes the kernel function. (5) We receive a calculated 
result from the device memory. (6) We release the device memory and GPU 
resources. 

Figure [2ja) shows the outline of the CGH calculation on the RV870 
GPU with OpenCL. When calculating a CGH with the resolution of N x x 
N y , we need to divide the CGH area into groups with the size of T x x T y . 
Therefore, the number of groups is N x /T x x N y /T y . In addition, each group 
has T x x T y items (In the CUDA, group and item are equivalent to block 
and thread, respectively). Each group is allocated to SIMD engines and 
each item simultaneously calculate Eq.([T]) by each stream core on an SIMD 
engine. 

In Figj2^b), we show the kernel source code of the CGH calculation on 
the RV870 GPU with OpenCL. The source code is not optimized because 
we understand it easily. The optimization is shown in the next subsection. 

Each group and item have the indices, group_id and locaLid. The 
OpenCL functions, get_group_id(0) and get_group_id(l), give us the hori- 
zontal and vertical indices of groupjds respectively. The OpenCL functions, 
get_local_id(0) and get_local_id(l), also give us the horizontal and vertical 
indices of locaLids respectively. 

The arguments of the kernel function are a CGH data {dJiol), an object 
data (d-obj), the number of object points (N) and the CGH size (N x ,N y ). 
An object data (d-obj) consists of the coordinates and the intensity as four 
float data (floatA). In lines 5, 6 and 7 of the Figl^b), the variables x and 
y calculate the coordinates (xh,Uh) on the CGH plane and adr calculates 
the address of the device memory for storing the calculation result I{xh,Vh)- 
In lines 11 to 16, a CGH point I(xh,Vh) can be calculated by iterating for 
N. Although seeming to execute only one kernel, in fact, each stream core 
corresponding to local Ad and global-id can perform the kernel in parallel. 

When calculating a CGH with 1, 920 x 1, 024 from a 3D object composed 
of 1, 024 points, the kernel with T x x T y = 16 x 16 took about 215ms. The 



4 



calculation speed of the kernel is slow. 
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9 CGH po\niI(x h ,y h ) 



kernel void Kernel{ 

global float * d hoi, global float4 * d_obj, 

intN, intNx, int Ny) 



igned int x = get_group_id(0} * getJocalsize(O) + get local id(O) ; 
igned int y = get group idf 1 } * get_local_size(1 ) + get_local_id(1 ) ; 
gned int adr-y*Nx+x; 



float 1=0.0; 

fbrp=0;I<N; I++J { 

float dx=x-d_obj[i].x; 
float dy-y-d_obj[i].y; 
float dz=d_obj[i].z; 

l+=d_obj[i].w * cos((dx*dx+dy*dy)*dz); 



d_hol[adr]=l; 



group _id(\) 



Figure 2: (a) Outline of CGH calculation on RV870 GPU and OpenCL. (b) 
Kernel for the CGH calculation using OpenCL without optimization. 



2.1 Optimization 

The previous source code is not optimized. In this section, we optimize the 
previous source code to obtain more acceleration speed. Figure [3] shows the 
optimized kernel function from Figj2]^b). 

For more acceleration, we applied the following optimization techniques 
to the previous kernel: (1) Recurrence algorithm (2) Shared memory (3) 
Loop unrolling (4) Vectorization (5) Native instruction. 

We proposed a fast CGH computation method using two recurrence for- 
mulas [H [T2"l [I~3"l 114) . Our recurrence algorithm can compute the phase 
component of the cosine function in Eq.([T]) by two recurrence formulas. The 
recurrence algorithm is as follows: 



N 

I(x h + n, y h ) = ^ A;cos(r n ) r n = T n _i + 5 n -i S n = 5 n -\ + A (2) 

3 

Here, we define r = Pj((x h - xj) 2 + (y h - y,-) 2 ), 5 = Pj(2(x h - xj) + 1), 
A = 2Pj. Eventually, we can compute the phase T n at the next coordinate 
by the two recurrence formulas. For more details, sec Rcf. [8 J 

In lines 15 to 18, we copy the object data from the device memory 
(d_obj) to a shared memory (s-obj). The shared memory can store 256 
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object points at a time because the shared memory is small and high- 
speed. Therefore, in the 13, we must iterate N/256 times. Note that 
barrier(C LK _LOC AL_M EM _F EN C E) means a barrier synchronization 
in line 18. It is equivalent to the syncthreads function in the CUDA. 

Loop unrolling is a well-known technique for optimizing a kernel function. 
It can be realized by reducing the number of iterations and replicating the 
body of the loop. Benefits of the loop unrolling are the capable to decrease 
the loop frequency, branch instructions and conditional instructions. In the 
optimized kernel, we applied the loop unrolling to the loop of object points. 
In lines 20 to 51 in Figj3l we can perform four object points per one iteration 
of the loop. In addition, we vectorize the operations in the loop using the 
floats type, in order to handle four object points at a time. For example, in 
line 22, we can calculate the four subtractions simultaneously. In the same 
way, the kernel can handle eight CGH points using the floats type at same 
time in lines 40 to 50. 

In lines 42 to 45, we used native cosine functions, instead of the normal 
cosine function shown in Figj2]Jb). The native cosine function can compute 
the fast cosine function using the hardware. 



1 : _kernel raid template_Kernel( 28: delta=(f )2.0Fdz[0]; 

2: global floats* d_hol, _global floats d_obj, 29: float4 theta_xy=( )(dx[0]*dx[0]+dy[0]*dy[0])*dz[0]; 

3: int N, int Nx, mt Ny) 3o : 

4:{ 31: float4 gO=theta_xy; 

5: _local float4 s_obj[256]; 32: float4 g1=g0+gamma_1 ; 

6: int tx = get_local_id(0); 33: float4 d1=gamma_1 +delta; 

7: intty-getjocaljd(l); 34: float4 g2=g1+d1; 

8: int X = (get_groupjd(0)*getjocal_size(0) + getjocal_id(0)} 4 8; 35: d2=d1+delta; 

9: int y = get_group_id{1 }*get_local_size(1 ) + get_local_id(1); 36: ! 

10: int adr=(y*Nx+x)/8; 37: float4 g7=g6+d6; 

11: float8 I=(float8)0.0f; 38: d7=d6+delta; 

12: 39: 

13: for (int a = 0; a < N; a += 256) { 40: float-4 cs[81; 

14: 41: 

15: s_adr1=ty*getjocal_size(0)+tx; 42: cs[0]=(native_cos(g0)); 

16: S_adr2=s_adr1+a; 43: cs[1]=(native_cos(g1)); 

17: s_obj[s_adr1]-d_obj[s_adr2]; 44- 

18: barrier(CLK_LOCAL_MEM_FENCE): 4 5: cs[7]=(native_cos(g7)); 

19: 46: 

20: for( i=0; i<256; i+=4){ 47: I.s0+=a[0].x*cs[0].x+a[0].y'cs[0].y+a[0].z"cs[0].z+a[0].w*cs[0].w; 

21: float4 a[4], dx[4],dy[4], dz[4] ; 48: I.s1+=a[0].x*cs[1].x+a[0].y*cs[1].y+a[0].z*cs[1].z+a[0].w*cs[1].w; 

22: dx[0]=s_obj[i+0]-( )x; 49: 

23: dy[0]=s_obj[i+1]-(float4)y; 50: I.s7+=a[0].x*cs[7].x+a[0].y*cs[7].y+a[0].z*cs[7].z+a[0].w*cs[7].w; 

24: dz[0]=s_obj[i+2]; 5 1 : } 

25: a[0] =s_obj[i+3]; 52: barrier(CLKLOCALMEMFENCE); 

26: 53: } 

27: float4gamma_1=((float4)2.0f*dx[0]+(float4)1.0f)*dz[0]; 54: d_hol[adr]=( )(I.s7,I.s6,I.s5,I.s4,I.s3,I.s2,I.s1,I.s0); 

55: } 



Figure 3: Kernel for the CGH calculation using OpenCL with optimization. 



2.2 Results 

Table [T] shows a comparison of the calculation times for a CPU alone, 
NVIDIA GPU and an AMD RV870 GPU. The size of the CGH is 1, 920 x 
1, 024. The specifications of the personal computer are as follows: Intel Core 
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2 Quad Q6600 (We used one core for the calculation), 2 GB of memory, Mi- 
crosoft Windows XP SP3. We used a GeForce GTX260 as the NVIDIA GPU 
board and its software development environment of CUDA version 2.3, and 
a RADEON HD5850 as the AMD GPU board and its software development 
environment of StreamSDK version2.0. The RADEON HD5850 GPU has 
1,440 stream cores (namely, 18 SIMD engines) with the clock frequency of 
725MHz. 

We can see that the optimization method for the AMD GPU described 
in Section \2. II can perform more than ten times faster than that without the 
optimization. In the calculation times for the NVIDIA GPU in the table, 
we optimized the kernel for the NVIDIA GPU using the same method as 
described in Section [2711 namely, recurrence algorithm, shared memory, loop 
unrolling, vectorization, native instruction. And, in the calculation times for 
the CPU alone in the table, we used Eq.([2]) for the CGH calculation. All 
calculation times using AMD and NVIDIA are superior to those using the 
CPU alone. In addition, the AMD GPU can calculate a CGH approximately 
two times faster than the NVIDIA GPU. 



Table 1: Comparison of calculation times on CPU alone, NVIDIA GPU and 
AMD GPU. 



Number of object points 


Time(ms) 


CPU 


NVIDIA GPU 


AMD GPU 


Without optimization 


With optimization 


512 


30 x 10 3 


33 


215 


19 


1024 


59 x 10 3 


59 


422 


31 


1536 


88 x 10 3 


85 


630 


44 


2048 


115 x 10 3 


112 


838 


56 


2560 


146 x 10 3 


139 


1045 


68 


3072 


174 x 10 3 


165 


1252 


80 


3584 


204 x 10 3 


192 


1464 


93 



3 Conclusion 

In this paper, we described a fast CGH calculation using an AMD RV870 
GPU with new architecture and its new software development environment, 
OpenCL. Many fast CGH calculation methods using a NVIDIA GPU and 
the CUDA have already been reported in optics field; however, a study using 
the RV870 GPU has not been reported so far. To the best of our knowledge, 
this article is the first report of using the RV870 GPU and OpenCL in 
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optics. Using the RV870 GPU and OpenCL, we can calculate 1, 920 x 1, 024 
resolution of a CGH from a 3D object consisting of 1, 024 points in about 30 
ms. The calculation speed can realize approximately two times faster than 
the NVIDIA GPU. 

This research was partially supported by the Ministry of Internal Affairs 
and Communications, Strategic Information and Communications R&D Pro- 
motion Programme (SCOPE), 2009, and Japan Society for the Promotion 
of Science (JSPS), Grant-in- Aid for Scientific Research (C) (21500094). 

References 

[1] S. A. Benton, "Experiments in holographic video imaging," Proc.SPIE 
IS8, 247-267 (1991). 

[2] K. Maeno, N. Fukaya, O. Nishikawa, K. Sato and T. Honda, 
"ELECTRO-HOLOGRAPHIC display using 15MEGA pixels LCD," 
Proc.SPIE 2652, 15-13 (1996). 

[3] C. W. Slinger et al., "Recent Developments in Computer-Generated 
Holography: Toward a Practical Electroholography System for Interac- 
tive 3D Visualization," Proc SPIE., 5290, 27-41 (2004). 

[4] J. Hahn, H. Kim, Y. Lim, G. Park and B. Lee, "Wide viewing angle 
dynamic holographic stereogram with a curved array of spatial light 
modulators," Opt. Express, 16, 12372-12386 (2008). 

[5] Y. Takaki and N. Okada, "Hologram generation by horizontal scanning 
of a high-speed spatial light modulator," Appl. Opt., 48, 3256-3261 
(2009). 

[6] H.Yoshikawa, "Fast Computation of Fresnel Holograms Employing Dif- 
ference," Opt.Rev., 8, 331 (2000). 

[7] K. Matsushima and M. Takai, "Recurrence Formulas for Fast Cre- 
ation of Synthetic Three-Dimensional Holograms," Appl. Opt., 39, 
6587 (2000). 

[8] T. Shimobaba and T. Ito, "Special-purpose computer for holography 
HORN-4 with recurrence algorithm," Comp. Phys. Commun., 148, 
160-170 (2002). 

[9] M. Lucente, "Interactive Computation of holograms using a Look-up 
Table," J. Electron. Imaging, 2, 28-34 (1993). 



8 



[10] H. Yoshikawa, T. Yamaguchi, and R. Kitayama, "Real-Time Gener- 
ation of Full color Image Hologram with Compact Distance Look- 
up Table," OSA Topical Meeting on Digital Holography and Three- 
Dimensional Imaging 2009, DWC4 (2009). 

[11] Y. Pan, X. Xu, S. Solanki, X. Liang, R. Bin A. Tanjung, C. Tan, and 
T. C. Chong," Fast CGH computation using S-LUT on GPU," Opt. 
Express, 17, 18543-18555 (2009). 

[12] T. Shimobaba, A. Shiraki, N. Masuda, and T. Ito, "Electroholographic 
display unit for three-dimensional display by use of special-purpose 
computational chip for holography and reflective LCD panel," Opt. 
Express, 13, 4196-4201 (2005). 

[13] T. Ito, et.al., "Special-purpose computer HORN-5 for a real-time elec- 
troholography," Opt. Express, 13, 1923-1932 (2005). 

[14] Y. Ichihashi, H. Nakayama, T. Ito, N. Masuda, T. Shimobaba, A. Shi- 
raki and T. Sugie "HORN-6 special-purpose clustered computing sys- 
tem for electroholography," Opt. Express, 17, 13895-13903 (2009). 

[15] N. Masuda, T. Ito, T. Tanaka, A. Shiraki and T. Sugie, "Computer 
generated holography using a graphics processing unit," Opt. Express, 
14, 2, pp.587-592 (2008). 

[16] L. Ahrenberg, P. Benzie, M. Magnor, J. Watson, "Computer gener- 
ated holography using parallel commodity graphics hardware," Opt. 
Express, 14, 17, pp.7636-7641 (2006). 

[17] H. Kang, F. Yaras, and L. Onural, "Graphics processing unit acceler- 
ated computation of digital holograms," Appl. Opt., 48, H137-H143 
(2009). 

[18] T. Shimobaba, et.al., "Numerical calculation library for diffraction in- 
tegrals using the graphic processing unit : the GWO library," J. Opt. 
A: Pure Appl. Opt., 10, 075308 (5pp) (2008). 

[19] T. Shimobaba, Y. Sato, J. Miura, M. Takenouchi and T. Ito, "Real- 
time digital holographic microscopy using the graphic processing unit," 
Opt. Express, 16, 11776-11781 (2008) 



9 



